Modern Pathology
○ Elsevier BV
Preprints posted in the last 90 days, ranked by how well they match Modern Pathology's content profile, based on 10 papers previously published here. The average preprint has a 0.07% match score for this journal, so anything above that is already an above-average fit.
Alabtah, G.; Alsaafin, A.; Alfasly, S.; Shafique, A.; Hemati, S.; Choudhary, A.; Ravishankar, I. K.; DiCaudo, D.; Nelson, S. A.; Stockard, A.; Leibovit-Reiben, Z.; zhang, N.; Kalari, K.; Murphree, D.; Mangold, A.; Comfere, N.; Tizhoosh, H. R.
Show abstract
Cutaneous squamous cell carcinoma (cSCC) poses significant clinical challenges due to its rising incidence and potential for metastasis. Histopathologic risk stratification is further limited by substantial inter-observer variability. Unsupervised AI approaches based on content-based image retrieval offer scalable and interpretable decision support for diagnostic pathology. The objective of this study was to evaluate the use of image retrieval within histopathology atlases to stratify cSCC tumor differentiation from whole-slide images (WSIs), while comparing different patch selection and feature extraction strategies. This retrospective study included 552 archived WSIs comprising 385 well-differentiated, 102 moderately differentiated, and 66 poorly differentiated cases collected across Mayo Clinic sites in Arizona, Florida, and Minnesota. Image atlases were constructed using multiple patch aggregation strategies (Mosaic, Collage, and Montage) and deep learning encoders (KimiaNet, PathDino, and H-Optimus-0). A leave-one-WSI-out evaluation framework was used to assess differentiation classification performance using accuracy, specificity, sensitivity, and F1 score. Mosaic combined with KimiaNet achieved the highest Top-1 accuracy (74.9%) and specificity (92.6%), while Mosaic with H-Optimus-0 yielded the best Top-5 accuracy (79.0%) and macro-F1 score (62.6%). Collage combined with KimiaNet produced the highest Top-5 specificity (99.5%). The generalizability of the evaluated AI models varied across hospitals, reflecting differences in imaging protocols, staining practices, and patient populations. Overall, unsupervised image search and retrieval provides effective, annotation-free support for cSCC differentiation and has the potential to enhance dermatopathology workflows when appropriate combinations of patch selection and feature ex-traction methods are employed.
Littlefield, N.; Bao, R.; Xia, R.; Gu, Q.
Show abstract
Image classification on digital pathology images relies heavily on convolutional neural networks (CNNs), yet the behavior of alternative neural computing paragigms in this domain remains insufficiently characterized. Spiking neural networks (SNNs), which process information through event-driven spike-based dynamics, have recently become trainable at scale but have not been evaluated under standardized colorectal pathology benchmarks. This study presents the first controlled comparison of SNNs and CNNs on the Minimalist Histopathology Image Analysis (MHIST) Dataset, a widely used publicly available benchmark designed for reproducible evaluation of histopathology classification models released by Dartmouth-Hitchcock Medical Center. The classification task focuses on the clinically important binary distinction between hyperplastic polyps (HPs) and sessile serrated adenomas (SSAs), a challenging problem characterized by substantial inter-pathologist variability, where HPs are typically benign and SSAs represent precancerous lesions requiring closer clinical follow-up. Histologically, HPs exhibit superficial serrated architecture and elongated crypts, whereas SSAs are characterized by broad-based, often complex crypt structures with pronounced serration. A conventional ResNet-18 architecture and its spiking counterpart are evaluated under matched training and inference to isolate the effect of spiking computation. Models performance is quantified using the area under the receiver operating characteristic curve (ROC-AUC), yielding 0.817 for the conventional CNN and 0.812 for the SNN. This comparison enables a direct assessment of how spiking computation influences discriminative performance in HPs versus SSAs binary classification and provides a benchmark reference for SNNs on the MHIST dataset. The code is publicly available at https://github.com/qug125/snn-crcp.
Spirgath, K.; Huang, B.; Safraou, Y.; Kraftberger, M.; Dahami, M.; Kiehl, R.; Stockburger, C. H. F.; Bayerl, C.; Ludwig, J.; Jaitner, N.; Kühl, A.; Asbach, P.; Geisel, D.; Hillebrandt, K. H.; Wells, R. G.; Sack, I.; Tzschätzsch, H.
Show abstract
Background & AimsThe increasing global prevalence of metabolic dysfunction-associated steatotic liver disease (MASLD) including metabolic dysfunction-associated steatohepatitis (MASH) creates an urgent need for objective methods of histopathological assessment. Conventional histological approaches are time-consuming and rely on interpreters experience. Therefore, the results obtained may suffer from high variability and only offer coarse categorisation. In this study, we propose a fully automated, deep-learning-based pipeline for the segmentation and characterisation of histological liver features for MASH/MASLD assessment. MethodsSegmentation was applied to H&E sections from 45 mice and 44 humans with MASH/MASLD. The method, which we named qHisto (quantitative histology), utilises the nnU-Net framework and quantifies key histological components of the MASH score, including macro- and microvesicular steatosis, fibrosis, inflammation, hepatocellular ballooning and glycogenated nuclei. Additionally, we characterized the tissue using novel features that are inaccessible through manual histology, such as the distribution of fat droplet sizes, aspect ratio of nuclei and heatmaps. ResultsqHisto parameters showed strong positive correlations with conventional histology scores (fat area R=0.91, inflammation density R=0.7, ballooning density R=0.49) and also with quantitative magnetic resonance imaging (fat area vs. hepatic fat fraction R=0.87). Our novel scores showed that deformation of nuclei is driven by large fat droplets rather than the overall amount of fat. ConclusionsA key advantage of our method is spatially resolved, precise histological quantification. These features provide a finely resolved assessment of disease severity than conventional categorical scoring. By automating time-consuming and repetitive readouts, qHisto improves standardisation and reproducibility of MASH/MASLD feature quantification and provides scalable, slide-wide readouts that can support histopathologists and enhance clinical assessment and therapeutic development. Impact and ImplicationsThe proposed method provides an objective, automatic tool for comprehensive, histological liver analysis of MASH/MASLD, which can be extended to other diseases and organs. By offering classic and novel quantitative parameters and scores, our method could support histologists in their daily routines and provide researchers with further insight into steatotic liver diseases.
Windell, D.; Magness, A.; Li, R.; Davis, T.; Thomaides Brears, H.; Larkin, S.; Beyer, C.; Aljabar, P.; Kainth, R.; Wakefield, P.; Langford, C.; Powell, N.; DeLegge, M.; Bateman, A. C.; Feakins, R.; Fryer, E.; Goldin, R.; Landy, J.
Show abstract
Background and AimsArtificial intelligence (AI) is increasingly applied to histological assessment in inflammatory bowel disease (IBD), but most approaches quantify features in isolation and ignore their anatomical location within the mucosa. We developed and validated PAIR-IBD (Perspectum AI Reading in IBD), an AI system that quantifies inflammatory cell populations, crypt injury, and epithelial damage within defined mucosal compartments to distinguish active disease, remission, and equivocal cases in ulcerative colitis (UC). MethodsA deep learning ensemble was trained on three IBD biopsy datasets to identify lymphocytes, neutrophils, eosinophils, and plasma cells, and to segment crypts, lamina propria (LP), and muscularis mucosae. Inflammatory cell densities and crypt injury metrics (mucin depletion, solidity, roughness, branching, and abscess formation) were quantified. PAIR-IBD outputs were compared between histologically active and remissive UC, evaluated in inconclusive cases, and correlated with manual pathology grading. ResultsNeutrophil density increased 3.5-fold in the LP and 15-fold within crypts in active UC (p<0.0001). Eosinophil density doubled and LP lymphocytes increased 1.4-fold. Active UC showed increased mucin depletion, crypt branching, and crypt abscesses, with reduced crypt solidity (p<0.0001 for all). PAIR-IBD metrics correlated with manual inflammatory and crypt injury scores (rs=0.23-0.72) and global indices (rs=0.27-0.65). Up to 89% of inconclusive cases aligned with remission-like profiles based on multiple independent AI metrics. ConclusionPAIR-IBD provides spatially resolved, quantitative assessment of inflammation and epithelial injury in UC, improving disease stratification and resolution of equivocal histology, with potential to support scoring consensus and improve accuracy of histological endpoints in clinical trials.
Shimizu, A.; Imamura, K.; Yoshimura, K.; Atsushi, T.; Sato, M.; Harada, K.
Show abstract
Drug-induced liver injury (DILI) is an acute inflammatory liver disease caused not only by prescription and over-the-counter medications but also by health foods and dietary supplements. Typically, DILI patients recover once the causative substance is identified and discontinued. In contrast, autoimmune hepatitis (AIH) results from the immune-mediated destruction of hepatocytes due to a breakdown of self-tolerance mechanisms. Patients presenting with acute-onset AIH often lack characteristic clinical features, such as autoantibodies, and require prompt steroid treatment to prevent progression to liver failure. Liver biopsy currently remains the gold standard to differentiate acute DILI from AIH; however, general pathologists face significant diagnostic challenges due to overlapping histopathological features. This study integrates pathology expertise with deep learning-based artificial intelligence (AI) to differentiate DILI from AIH using histopathological images. Our AI model demonstrates promising classification accuracy (Accuracy 74%, AUC 0.81). This paper presents a detailed pathological analysis alongside AI methods, discusses the current model performance and limitations, and proposes directions for future improvements.
Aswolinskiy, W.; Wong, J. K. L.; Zapukhlyak, M.; Kindruk, Y.; Paulikat, M.; Aichmüller, C.
Show abstract
Digitizing large histopathology archives requires processing millions of scanned whole slide images that must be validated rapidly. Automated organ-of-origin classification can accelerate quality control and enable early detection of mislabeled specimens. We developed a deep learning model that classifies the organ of origin from H&E-stained slides using a single low-resolution thumbnail per slide in under one second. For training, we used thumbnails from 16,624 slides from the TCGA and CPTAC archives, which contain mostly primary tumor resections. The images were categorized into 14 classes based on the most common primary sites in TCGA: Bladder, Brain, Breast, Colorectal, Kidney, Liver, Lung, Pancreas, Prostate, Skin, Stomach, Thyroid gland, Uterus, and Other (encompassing the remaining tissue types). We evaluated our approach on two independent external cohorts: a 5-class cohort with 2,857 slides (Colorectal, Kidney, Liver, Pancreas, Prostate) and a comprehensive 14-class cohort (12,348 slides). The model achieved 90% balanced accuracy for the 5-class cohort and 62% for the full 14-class cohort. Notably, when considering only the predictions with high confidence, 53% of the large cohort could be classified with 74% balanced accuracy. Manual review of high-confidence misclassifications suggested that some may reflect errors in the ground truth rather than model error. Mean model inference time was 0.2s per slide on an NVIDIA L4 GPU. Our deep learning approach demonstrates high classification performance with very low inference time, indicating its potential for real-time and cost-effective quality control in digital pathology.
Reitsam, N. G.; Gustav, M.; Jesinghaus, M.; Maerkl, B.; Foersch, S.; Kather, J. N.
Show abstract
Large language models (LLMs) are evolving into diagnostic co-pilots, yet current benchmarks fail to test the integrated, stepwise reasoning required in diagnostic pathology. Here, we present Pathologys Last Exam (PLE), a curated, highly detailed, text-based benchmark of 100 complex cases spanning organ systems, enriched for rare/challenging entities, plus 20 adversarial cases designed to stress-test model safety. Each case provides structured blocks (Primary, Clinical, Histopathology, IHC/Special Stains, Molecular Pathology) with stepwise information release mirroring real sign-out. We evaluated five LLMs (one proprietary, four open-source) across different stages. While the best model (GPT-5) achieved 70% accuracy on full evidence, performance on safety tests was alarming. Models frequently failed to detect biological contradictions, confidently diagnosing nonsensical "mix-up" cases rather than refusing them. This reveals a critical safety gap: high diagnostic capability is currently coupled with a dangerous inability to recognize impossible clinical scenarios. PLE provides a framework to measure and mitigate these risks before clinical deployment, as well as a foundation for developing multimodal evaluation protocols that can be extended to vision-language models and autonomous diagnostic agents in the future.
Jeong, W. C.; Kim, H. H.; Hwang, Y.; Hwang, G.; Kim, K.; Ko, Y. S.
Show abstract
The Updated Sydney System (USS) provides a standardized framework for grading gastritis and stratifying gastric cancer risk. However, subjective observer variability and labor-intensive workflows impede its routine clinical use. To address these challenges, we developed SydneyMTL, a multi-task deep learning framework that uses Multiple Instance Learning (MIL) with task-specific attention pooling to predict severity grades across all five USS attributes simultaneously. Trained on an unprecedented cohort of 50,765 whole-slide images (WSIs), SydneyMTL generates interpretable histologic evidence for clinical practice. In retrospective evaluations against 24 board-certified pathologists, the model achieved an overall mean lenient accuracy of 89.1%, with 22 pathologists exhibiting >80% agreement with the model. When evaluated on an expert-adjudicated "Golden dataset," the models performance improved to 90.2%, demonstrating its capacity to align with multi-expert consensus and filter individual annotator noise. Latent space analysis confirmed that SydneyMTL captures the ordinal structure of the USS, by representing disease severity as a continuous biological spectrum rather than as disjoint categories. Finally, a randomized crossover reader study showed that AI-assisted review significantly reduced interpretation time and improved inter-observer agreement, establishing SydneyMTL as a scalable tool for supporting standardized gastric cancer risk stratification. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=154 HEIGHT=200 SRC="FIGDIR/small/26346304v1_ufig1.gif" ALT="Figure 1"> View larger version (66K): org.highwire.dtl.DTLVardef@8890daorg.highwire.dtl.DTLVardef@1de007dorg.highwire.dtl.DTLVardef@1f243d1org.highwire.dtl.DTLVardef@425eb9_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsO_LISydneyMTL is the first unified framework to simultaneously predict the full 4-tier severity grades across all five Updated Sydney System attributes. C_LIO_LITrained on a massive cohort of 50,765 whole slide images, the model aligns with multi-expert consensus on a rigorous "Golden dataset". C_LIO_LIAI assistance significantly reduces pathologist reading time and harmonizes inter-observer variability in real-world clinical workflows. C_LIO_LILatent space analysis confirms that SydneyMTL preserves the biological ordinality of disease severity without explicit ordinal constraints. C_LI The bigger pictureGastritis is among the most frequent diagnoses in gastrointestinal pathology, and its histologic severity is central to gastric cancer prevention. In routine practice, pathologists convert subtle mucosal changes into semi-quantitative, ordinal grades using the Updated Sydney System, which evaluates five co-existing histologic dimensions. While this framework provides a shared language, grading is labor intensive and inherently dependent on reader-specific thresholds, creating variability that affects risk stratification and surveillance. A key concept motivating our study is that gastritis is not defined by a single finding but by multiple criteria that co-occur and interact. This suggests that computational models should learn these criteria jointly - capturing their biological correlations and the continuum of severity - rather than treating each grade as an isolated classification task. SydneyMTL implements this perspective through a unified multi-task, weakly supervised approach that learns directly from a massive cohort of 50,765 routine whole-slide images. Beyond diagnostic accuracy, our work reveals that the model preserves the ordinality of severity in its representation space, supporting the biological view that discrete clinical categories approximate an underlying continuous biological spectrum. Its attention-based explanations also connect model outputs to interpretable tissue evidence, enhancing clinical trust. Crucially, by harmonizing inter-observer variability, SydneyMTL provides a more reliable foundation for gastric cancer risk assessment, ensuring that premalignant changes are captured with greater consistency. More broadly, our findings reposition AI for gastritis from narrow detection toward scalable, evidence-based decision support that can standardize grading practices and reduce cognitive burden on the global pathology workforce.
Niggemeier, L.; Hoelscher, D. L.; Herkens, T. C.; Gilles, P.; Boor, P.; Buelow, R.
Show abstract
IntroductionKidney biopsy reports contain rich information that is clinically actionable and useful for research. However, the narrative format hinders scalable reuse. We here investigated whether open-source large language models (LLMs) can extract relevant, standardized readouts from native kidney biopsy pathology reports. MethodsGerman free-text native kidney biopsy reports were parsed with three open-source LLMs (Llama3 70B, Llama3 8B, MedGemma) to generate structured JSON outputs covering relevant report elements (e.g., diagnosis, glomerular counts, histopathological patterns). Two independent observers manually curated the same report elements; disagreements between the two were resolved by an experienced nephropathologist to create the final ground truth. Performance was assessed using strict and soft matching and summarized accuracy. Inter-rated agreement was quantified using Cohens and Lights Kappa with 95% confidence intervals via 1000-times bootstrapping. ResultsLlama3 70B achieved the highest overall accuracy (93.3% strict, 97.1% soft), followed by MedGemma. These larger models showed near perfect performance for explicit and discrete variables and positivity of immunohistochemistry markers, while accuracy decreased for report elements requiring interpretation (e.g., primary diagnosis, interstitial inflammation in fibrosis vs. non-fibrotic cortex). Human raters showed strong agreement for the primary diagnosis ({kappa} = 0.74, 95% CI 0.64-0.84). Adding Llama3 70B or MedGemma as a third rater increased overall agreement (0.82, 95% CI 0.74-0.89 and 0.78, 95% CI 0.69-0.85, respectively), whereas Llama3 8B reduced it. ConclusionsOpen-source LLMs can accurately transform narrative nephropathology reports into a structured and machine-readable format, potentially supporting scalable retrospective cohort building. While some report elements can be extracted without supervision, interpretation-dependent elements should be supervised by a human observer. Lay SummaryRetrospective data collection from nephropathology reports is essential for building informative cohorts in computational nephropathology research, yet manual processing of narrative reports is time-consuming and limits scalability. In this study, we demonstrate that open-source large language models can reliably extract key diagnostic, quantitative, and descriptive data elements from kidney biopsy reports with high accuracy. While factual and clearly stated report elements can be extracted automatically, findings that require contextual or interpretative judgment still benefit from expert supervision. Overall, this approach substantially reduces manual effort and enables efficient generation of structured datasets from diagnostic routine, facilitating the development of kidney registries and future computational nephropathology research. In addition, such systems could be implemented into the routine diagnostic workflow, to directly transform narrative reports into structured data.
Ayad, M. A.; McCortney, K.; Congivaram, H. T. S.; Hjerthen, M. G.; Steffens, A.; Zhang, H.; Youngblood, M. W.; Heimberger, A. B.; Chandler, J. P.; Jamshidi, P.; Ahrendsen, J. T.; Magill, S. T.; Raleigh, D. R.; Horbinski, C. M.; Cooper, L. A. D.
Show abstract
Meningiomas are the most common primary brain tumors and, despite their benign reputation, often behave aggressively. Meningiomas are morphologically heterogeneous, yet the full significance of their histologic diversity is unclear. This is in large part because many features are not readily quantifiable by traditional observer-based light microscopy. Molecular testing improves prognostic stratification, but is not universally accessible. We therefore sought to determine whether an artificial intelligence (AI)-trained program could predict specific genomic and epigenomic patterns in meningiomas, and whether it could extract more prognostic information out of standard hematoxylin and eosin (H&E) histopathology than the current WHO classification. To do this, we developed Morphologic Set Enrichment (MSE), an interpretable computational pathology framework that quantifies statistical enrichment of morphologic patterns, cells, and tissue architecture from H&E whole-slide images. The MSE meningioma histology program was able to accurately predict DNA methylation subtypes and concurrent chromosome 1p/22q losses, in the process identifying specific morphologic patterns associated with key genomic and epigenomic alterations. It also added prognostic value independent of standard clinical and pathological variables. These results demonstrate that AI-based quantitative morphologic profiling can capture clinically and biologically relevant information that redefines risk stratification for meningiomas, incorporating histological information not included in existing grading schemes.
Carr, L. L.; Sankaranarayanan, A.; Ha, K.; Rawlani, M.; Kazerouni, A. S.; Specht, J.; Kennedy, L. C.; Reiter, D.; Dintzis, S.; Hippe, D. S.; Kilgore, M. R.; Symonds, L.; Partridge, S. C.; Mittal, S.
Show abstract
Stromal tumor-infiltrating lymphocytes (sTILs) are promising biomarkers for predicting therapeutic outcomes in triple-negative breast cancer (TNBC), with higher sTIL levels correlating with improved chemotherapy response and survival outcomes. Currently, sTILs are manually evaluated by pathologists, which is prone to inter-reader variability. In this study, we have developed an AI-driven TIL segmentation pipeline to process entire diagnostic hematoxylin-and-eosin-stained whole slide images for reproducible scoring (global TILseg scoring) and reliable prognostication. This pipeline was optimized and tested using two independent TNBC patient cohorts (n = 57 in the discovery cohort, n = 43 in the validation cohort) with clinical outcomes and follow-up data. The global scores generated by TILseg showed moderate to high concordance with expert scoring (Spearman R = 0.84-0.89) and improved patient stratification (p-value = 0.0191) as compared to manual scoring (p-value = 0.0663). Additionally, we investigate how the spatial localization of sTILs (spatial TILseg) impact survival outcomes by identifying TILs in selected stromal subsets (0.02-2 mm from the epithelial clusters). Our findings have shown that TILs up to 50 {micro}m from epithelial regions prove to be most prognostic in predicting recurrence-free survival post-neoadjuvant chemotherapy with higher statistical significance than both manual and global TILseg scoring. Further, spatial TILseg scoring was more significantly associated with pathological complete response status in both patient cohorts. In summary, we present an AI-based digital tool for robust sTIL scoring and spatial mapping to enhance its potential as both a diagnostic and prognostic biomarker, particularly in TNBC patients. SIGNIFICANCEAn automated and spatially resolved AI tool for sTILs scoring enhances patient risk stratification based on both response to treatment and recurrence-free survival, establishing its relevance as an independent prognostic marker.
Abolfathi, H.; Maranda-Robitaille, M.; Lamaze, F. C.; Kordahi, M.; Armero, V. S.; Orain, M.; Fiset, P. O.; Joubert, D.; Desmeules, P.; Gagne, A.; Yatabe, Y.; Bosse, Y.; Joubert, P.
Show abstract
BackgroundHistologic descriptors such as lymphovascular invasion (LVI), visceral pleural invasion (VPI), spread through air spaces (STAS), and grading system have each been associated with adverse outcomes in lung adenocarcinoma (LUAD). However, with the exception of VPI, these features are not formally incorporated into the TNM staging system. We evaluated the prognostic value and incremental contribution of these histologic descriptors within the framework of the 9th edition TNM staging system. MethodsIn total, 1,745 individuals diagnosed with stage I-III invasive non-mucinous lung adenocarcinoma (NM-LUAD) were included in this study, comprising 1139 French-Canadian patients who underwent surgical resection at IUCPQ-Universite Laval (discovery cohort) and 606 patients from the National Cancer Center Hospital in Tokyo, Japan (validation cohort). The objective of this study was to assess the prognostic contribution of histologic descriptors, including STAS, and LVI, as complements to conventional 9th edition TNM staging. ResultsGrade 3 tumors, LVI, and STAS were identified in 880 (50.4%), 809 (46.4%), and 775 (44.4%) of 1745 cases, respectively. Histologic grade and LVI demonstrated the strongest associations, particularly in early-stage disease, while STAS exhibited a stage-dependent effect, being more impactful in stages II-III. VPI showed less consistent prognostic value. Incorporating these histologic descriptors into TNM staging improved prognostic model performance, with the largest gains driven by histologic grade and LVI, while STAS provided additional, complementary prognostic refinement. ConclusionThese findings demonstrate that key histologic descriptors--including grading system, LVI, and STAS--represent robust and reproducible prognostic parameters. Importantly, these descriptors provide complementary, stage-dependent information that may enhance risk stratification and inform refinement of future TNM staging frameworks, including the forthcoming 10th edition.
Yoo, J.; Karthikeyan, R.; Kamat, K.; Chan, C.; Samankan, S.; Arbzadeh, E.; Schwartz, A.; Latham, P.; Chung, I.
Show abstract
Ductal carcinoma in situ (DCIS) is a non-invasive breast cancer spanning a biologic continuum from atypical ductal hyperplasia (ADH) to high-grade lesions with variable risk of progression to invasive ductal carcinoma (IDC), yet diagnostic accuracy remains limited when based on morphologic assessment via hematoxylin and eosin (H&E) alone. TRPV4, a mechanosensitive ion channel we previously demonstrated to exhibit pathology-dependent spatial distribution patterns in DCIS, offers a biologically motivated immunohistochemical (IHC) marker that may refine classification beyond routine H&E assessment. We evaluated whether deep learning models trained on TRPV4 IHC outperform those trained on H&E for DCIS classification. We assembled a multi-institutional dataset of paired H&E and TRPV4 IHC whole-slide images from 108 patients (24,248 image tiles), with both stains available for most cases in an internal development cohort (n=69) and an external test cohort (n=39). Each cohort was digitized on different scanners to assess cross-platform robustness. Tiles from annotated regions were grouped into four ordered classes reflecting DCIS progression: normal/benign, ADH/low-grade DCIS, high-grade DCIS, and IDC. Xception and EfficientNet-B0 convolutional neural networks were trained with patient-level 3-fold cross-validation on the development cohort and evaluated as ensembles on the test cohort. On external testing at the patient level, H&E-based ensembles showed moderate performance (macro-F1=0.43-0.44, macro-AUC=0.73-0.80), whereas TRPV4 IHC-based models substantially improved classification (macro-F1=0.68-0.72, macro-AUC=0.91-0.92). Across tile-level predictions, 68-79% of errors were between adjacent grades, consistent with an ordinal DCIS spectrum. Per-class tile-level analyses on the external test cohort showed the greatest improvement with TRPV4 IHC over H&E for ADH/low-grade DCIS (AUC 0.83-0.84 vs 0.70-0.81) and IDC (AUC 0.74-0.79 vs 0.65-0.66), supporting classification across the DCIS progression spectrum. These findings support TRPV4 IHC as a mechanistically grounded complement to H&E, improving deep learning-based DCIS classification in a pilot multi-institutional setting.
Leyva, A.; Akbar, A. R.; Niazi, M. K. K.
Show abstract
Protein expression within oncogenic or suppressive pathways is a hallmark indicator of oncogenesis. While traditional AI models in digital pathology attempt to predict singular proteins, there is a need to predict the downstream expression of proteins to indicate the propagation of signals. RNA expression provides novel information, but does not provide information about the downstream propagation of protein signals or whether those signals are functional. Using Reverse Phase Protein Array (RPPA) data with whole-slide images (WSIs) from the publicly available Cancer Genome Atlas Breast Adenocarcinoma dataset (TCGA-BRCA), we predict the expression of five key proteins identified from the apoptosis cascade, using DNA damage and repair (DDR) cascades as a biological control. Furthermore, we examine the performance of patch-level Vision Transformers (ViT) on the regression task, which was tested against the designed cellular-level ViT, CellRPPA. Our results demonstrate that patch-level vision transformers were unable to obtain statistically significant predictive results, achieving R-squared values {inverted exclamation} 0.1 for all folds. In addition, CellViT obtained R-squared values {inverted question} 0.1 in all five test folds. We also show that morphologically indicative cascades, such as the apoptosis cascade, provide significantly higher performance compared to the DDR cascade.
Baumann, J.; Kanani, B.; Tamboli, S.; Kucherenko, Y.; Fritz, P.; Aswolinskiy, W.; Bosch, C.; Paulikat, M.; Wong, J. K. L.; Arora, B.; Zapukhlyak, M.; Eickmeyer, J.; Pavlova, M.; Laskorunskyi, R.; Kindruk, Y.; Kalteis, S.; Tamang, N.; Aichmüller-Ratnaparkhe, M.; Yazli, G.; Uluc, G.; Adam, P.; Quick, D.; Aichmüller, C.
Show abstract
Pathology faces persistent challenges including a global shortage of specialists, uneven access to expertise, increasing diagnostic complexity, and a growing need for second-opinion consultations. While digital and telepathology platforms address parts of this problem, existing solutions often trade accessibility for structured, workflow-aware clinical integration. At the same time, multimodal medical AI shows promise for diagnostic support but raises concerns regarding transparency, automation bias, and clinical accountability. We present PaiX Net, a structured, AI-augmented second-opinion platform designed to support collaborative pathology consultation while preserving human decision ownership. The platform integrates standardized case templates, moderated expert discussion, and human-centered AI assistance within a scalable, browser-based architecture compliant with data protection requirements. AI support is embedded at defined workflow stages to assist with case structuring, summarization, and exploratory interpretation, while diagnostic conclusions remain under expert control. To mitigate automation bias, AI-generated content is visually separated, collapsed by default, and presented only after independent expert input. PaiX Net incorporates a multimodal medical AI model (MedGemma-4B), selected for its open availability and computational efficiency, and fine-tuned on curated, anonymized consultation cases. An illustrative retrospective evaluation demonstrates substantial reductions in case preparation time and modest but consistent improvements in diagnostically relevant summaries. PaiX Net illustrates how structured, human-centered AI integration can enhance access to expert second opinions while maintaining clinical accountability and supporting continuous human-AI learning in digital pathology.
Kheiri, F.; Rahnamayan, S.; Makrehchi, M.
Show abstract
Bias in machine learning is a persistent challenge because it can create unfair outcomes, limit generalization, and reduce trust in real-world applications. A key source of this problem is shortcut learning, where models exploit signals linked to sensitive attributes, such as data source or collection site, instead of relying on task, relevant features. To tackle this, we propose the Deceptive Signal metric, a novel quantitative measure designed to assess the extent of a models reliance on hidden shortcuts during the learning process. This metric is derived via the Deceptive Bias Detection pipeline, which isolates shortcut dependence by contrasting model behavior under two controlled conditions: (1) Full Exclusion, where a sensitive subgroup is completely removed from training; and (2) Partial Exclusion, where the model has limited access to specific classes within the subgroup. By calculating the behavioral shift between these settings, the Deceptive Signal metric provides a concrete value representing the models proneness to learning task-irrelevant patterns. In experiments with the TCGA histopathology dataset, our metric successfully quantified strong dependencies on center-specific artifacts in models trained for cancer classification. Author summaryDeep learning models are becoming powerful tools in healthcare, but they often suffer from a critical vulnerability: they can get the right answer for the wrong reason. In medical imaging, an AI might correctly identify a tumor not by analyzing the tissue, but by recognizing irrelevant digital markers unique to the specific hospital or scanner that produced the image. This phenomenon, known as shortcut learning, makes AI systems appear accurate at first glance while remaining unreliable for real-world patient care. To solve this, our research moves beyond simple accuracy checks and introduces a specific quantitative metric for shortcut learning. We developed a testing framework that forces the model into controlled training scenarios, deliberately withholding specific "shortcut" information to see how the model reacts. By mathematically comparing the models behavior across these scenarios, we calculate a precise score that indicates the magnitude of the models dependence on irrelevant patterns. This metric allows to put a concrete number on a models trustworthiness and ensuring that medical decisions are driven by biology, not background noise.
Bonn, S.; Zimmermann, M.; Sauter, G.; Bengtsson, E.; Huber, T. B.; Baumbach, J.; Lennartz, M.; Fuhlert, P.; Witte, A.
Show abstract
BackgroundVision Foundation Models (VFM) have emerged as a promising approach for computational pathology, offering scalable feature representations that may reduce labelled-data requirements and improve robustness to variation in tissue preparation and digitisation. However, VFM decoder and dataset size requirements as well as the performance under real-world domain shifts remain unclear. MethodsWe evaluated six contemporary VFMs on a protocol-variant Prostate Cancer (PCa) dataset comprising 37 683 tissue microarray spot images from 10 412 patients. The dataset includes six controlled domain shifts arising from differences in staining duration, section thickness, scanner type, and sampling location. Two clinically relevant downstream tasks were examined: ISUP grading and 5-year relapse prediction. We compared two decoder architectures, quantified dataset-size requirements using a saturation analysis (45-5727 samples), and assessed cross-domain robustness using out-of-domain test sets. FindingsLarger VFMs consistently outperformed smaller models in peak accuracy and robustness metrics. Contrary to expectations of data efficiency, all models showed strong dependence on training-set size, requiring at least 1000 samples to approach stable results. All VFMs showed notable degradation under protocol-level domain shifts, with performance reductions of 4 to 13 percentage points in both cancer grading and relapse prediction, although larger models exhibited somewhat greater robustness. Furthermore, KNN-based probing performed substantially worse than a decoder-based approach across all architectures. InterpretationsDespite their strong representational capacity, current VFMs do not yet provide reliable domain generalisation or data-efficient performance in computational pathology. Decoder design remains essential, and substantial amounts of labelled data are still required to achieve clinically meaningful accuracy. Further advances in pre-training strategies, decoder architectures, and domain adaptation methods will be crucial for translating VFMs into robust clinical tools. Research in contextO_ST_ABSEvidence before this studyC_ST_ABSThis study focuses on pathology foundation models, which offer promising improvements in performance, data requirements, and robustness to domain shifts for computational pathology. To identify relevant studies, we searched in Google Scholar for research published before 1 April 2025. We searched for studies introducing novel foundation models trained on pathology images, or reviews comparing those models in terms of performance or robustness. The search terms used were computational pathology, benchmarking, review and foundation model, as well combinations of these terms. We found that many studies focus on increasing the complexity of pathology foundation models while using increasingly extensive and heterogeneous pre-training datasets. Various benchmarking studies demonstrate the superior performance and robustness of more recent and larger foundation models. However, these studies have limitations in their evaluation datasets. Either they cover only a domain shift due to a different scanner device, or they have small sample sizes. We also identified a research gap regarding the requirement for large datasets to train a decoder based on a pathology foundation model for a specific downstream task. Added value of this studyThe goal of this study was to evaluate the necessity of large downstream task datasets and the domain shift robustness of multiple pathology foundation models. For this purpose, we used our internal protocol-variant prostate cancer dataset, which provides a controlled evaluation setup as multiple domain shift types have been intentionally and separately introduced for different sub-datasets. Our saturation analysis revealed that at least 1000 samples were necessary to achieve good performance. Furthermore, our findings show that none of the evaluated foundation models are robust against all of our domain shifts, though larger models generally perform better. Implications of all the available evidenceThis study reveals that increasing the capacity of pathology foundation models improves performance and robustness. However, we demonstrated that all models exhibit some degree of performance degradation for certain domain shifts and require substantial datasets for training on downstream tasks. These limitations demonstrate that pathology foundation models do not fully address the issues of robustness and data requirements.
Hu, Y.; Batchkala, G.; Gaitskell, K.; Domingo, E.; Li, B.; Zhang, T.; Li, Z.; Friedrich, M.; Woodcock, D.; Verrill, C.; Rittscher, J.
Show abstract
Computational-pathology foundation models (PFMs) have demonstrated remarkable accuracy in a wide range of whole-slide image (WSI) analyses, yet their morphological reasoning and potential biases remain opaque. Here we introduce an attention-shift monitoring framework that tracks tissue-level attention influx and efflux before and after fine-tuning a slide-level aggregator. We apply our interpretable framework across five clinically relevant tasks (lymph-node metastasis detection, lung-cancer subtyping, ovarian-cancer drug-response prediction, colorectal-cancer molecular classification and Marsh grading of colitis). We compare two market-validated PFMs, UNI and prov-GigaPath, using dynamically pooled, compressed embeddings under identical running conditions. Although both models achieve comparable ROC-AUC and balanced-accuracy scores, their attention-shift trajectories diverge sharply: each exhibits broad attention efflux from most tissue regions and highly concentrated, yet minimally overlapping, influx into distinct phenotypic zones. The attention heterogeneity in zero-shot mode and inconsistency of post-tuning attention shifts indicate that the presentation of interpretability depends primarily on each models intrinsic feature priors rather than on accuracy or fine-tuning. Our findings uncover a systemic stability gap in PFM interpretability, masked by high performance metrics, and underscore the need for richer explanation tools, bias-monitoring protocols and diversified pre-training strategies to ensure safe clinical deployment.
Leyva, A.; Ackbar, A. R.; Niazi, M. K. K.
Show abstract
Replication timing is a costly but powerful tool for characterizing cellular mechanisms that underlie chromatin organization, cancer epigenetics, and genomic instability. Genome-wide replication timing profiles reflect the temporal order of DNA synthesis during S phase and are closely linked to chromatin accessibility, transcriptional activity, and proliferative state. Prior work has demonstrated a robust inverse relationship between replication timing and DNA methylation at the domain scale, enabling methylation-based proxies to approximate replication timing in large cohorts where direct experimental measurement is impractical. Given the tight coupling between replication timing, chromatin structure, and cellular phenotype, we hypothesized that histologic morphology encodes information consistent with replication timing states. To test this hypothesis, we implemented Vision Transformer (ViT) architectures to predict replication timing proxies from whole-slide histopathology images. Patch-level embeddings were extracted using a pretrained ViT and aggregated through attention-based multiple instance learning to predict sample-level replication timing. In parallel, CellViT models were employed to perform cell-level prediction, enabling direct comparison between patch-based and cellular representations. Across both modeling strategies, statistically significant correlations were observed between image-derived features and methylation-based replication timing proxies. Patch-level models achieved correlations of approximately 46%, while cell-level models consistently reached correlations exceeding 50% on the validation cohort. Prediction error remained stable across folds, with mean absolute error and mean squared error values ranging between 0.4 and 0.6. These results demonstrate that replication timing-associated epigenomic states are reflected in tissue morphology and can be inferred using deep learning models applied to routine histopathology. This work establishes a feasible, noninvasive framework for replicosomic inference from whole-slide images and supports future efforts toward spatially resolved replication timing analysis and integrative modeling of replicative stress in cancer.
van den Berg, N.; Schoenpflug, L.; Horeweg, N.; Volinsky-Fremond, S.; Barkey-Wolf, J.; Andani, S.; Lafarge, M. W.; Oertft, G.; Jobsen, J. J.; Razack, R.; Gerestein, K.; Jonges, T.; de Kroon, C. D.; Nout, R.; Tseng, D.; Kuijsters, N.; Powell, M. E.; Khaw, P.; Shepherd, L.; Leary, A.; de Boer, S. M.; Kommoss, S.; van den Heerik, A. S. V. M.; Haverkort, M. A. D.; Church, D.; de Bruyn, M.; Smit, V. T. H. B. M.; Steyerberg, E.; Creutzberg, C. L.; Koelzer, V. H.; Bosse, T.
Show abstract
POLE sequencing for somatic mutations (POLEmut) guides adjuvant therapy in endometrial cancer (EC), but cost and infrastructural considerations lead to limited uptake. Omission of POLE testing leads to unnecessary exposure to radiotherapy and/or chemotherapy. We developed POLARIX, a multiple instance deep learning model with attention pooling, which predicts POLE mutation status from routine hematoxylin and eosin whole-slide images (WSIs). Trained on 2,238 cases from eleven EC cohorts, POLARIX showed clinical-grade discrimination across three external cohorts (Pooled: AUC=0.95, 95% CI: 0.91-0.98; n=68/481 POLEmut/POLEwt). Attention maps highlight POLE morphologies. Clinical applicability is demonstrated using predefined thresholds based on three resource scenarios. The most sensitive threshold ("Low") yields a test reduction of 77% (73%-81%) (sensitivity: 93% (85%-99%), specificity: 89% (87%-92%)). POLARIX is an interpretable and cost-efficient approach to reduce POLE testing in women with endometrial cancer, broadening access to precision oncology.